27 May, 2021
Data (geo)science project
Exploring your data
Modelling
Wrangling
Each section:
5-10 minutes of introduction
5-10 minutes of live coding and questions
# Install point from GitHub:
# install.packages("devtools")
devtools::install_github("MartinSchobben/PAGES")
> # Install PAGES from GitHub:
> # install.packages("devtools")
> devtools::install_github("MartinSchobben/PAGES", build_vignettes = TRUE)
This class is completely based on Hadley Wickham’s and Garrett Grolemund’s R4DS.
I have augmented the examples with cases from geology.
Wickham and Grolemund (2016)
The tidyverse universe: opinionated collection of R packages designed for data science
PAGE has the lazy load data: bonenburg (geochemistry) and kuhjoch (palynology).
Wickham and Grolemund (2016)
RStudio projects
A clear directory and file structure with meta-data to describe data. Raw data should be read-only and backed-up.
R script with a clear documentation of all steps involved.
Publish all aspects of this workflow along with your paper.
Pipe: %>%
kuhjoch_grps <- group_by(kuhjoch, type) kuhjoch_mean <- summarise(kuhjoch_grps, count)
kuhjoch_mean <- summarise(group_by(kuhjoch, type), count)
kuhjoch_mean <- group_by(kuhjoch, type) %>% summarise(count)
Wickham and Grolemund (2016)
“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>,
orientation = <ORIENTATION>
) +
<FACET_FUNCTION>
Grammar of graphics (Wilkinson et al. 2005)
Scatterplot: maps each observation to a horizontal and vertical position and the geom represents this as a point
ggplot(data = bonenburg) + geom_point(mapping = aes(x = del13Ctoc, y = Height))
The colour, shape and linetype can also be used to map additional variables. Here I use the stratigraphy (categorical) as an additional variable.
ggplot(data = bonenburg) + geom_point(mapping = aes(x = del13Ctoc, y = Height, colour = Strat))
Internal (statistical) transformation of <DATA>.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>,
orientation = <ORIENTATION>
) +
<FACET_FUNCTION>
Typical questions
Wickham and Grolemund (2016)
ggplot(data = bonenburg) + geom_boxplot(mapping = aes(y = Strat, x = del13Ctoc), stat = "boxplot")
ggplot(data = bonenburg) + geom_boxplot(mapping = aes(y = reorder(Strat, Height), x = del13Ctoc))
The facets splits data according to a categorical variable.
ggplot(data = bonenburg_long) + geom_point(mapping = aes(x = value, y = Height), na.rm = TRUE) + facet_grid(cols = vars(measurement), scales = "free_x")
Discern patterns (or signals) from noise.
Exploration, not confirmation or formal inference!
Wickham and Grolemund (2016)
ggplot(data = bonenburg_cross, mapping = aes(x = value, y = del13Ctoc)) + geom_point(aes(colour = Strat)) + facet_wrap(facets = vars(measurement), scales = "free")
ggplot(data = bonenburg_cross, mapping = aes(x = value, y = del13Ctoc)) + geom_point(aes(colour = Strat)) + geom_smooth() + facet_wrap(facets = vars(measurement), scales = "free")
ggplot( bonenburg, aes(x = TOCcfb, y = del13Ctoc) ) + geom_point() + geom_smooth(method = "lm")
ggplot( bonenburg, aes(x = log(TOCcfb), y = del13Ctoc) ) + geom_point() + geom_smooth(method = "lm")
library(modelr) mod <- lm(del13Ctoc ~ log(TOCcfb), data = bonenburg) ggplot(data = add_residuals(bonenburg, mod)) + geom_point(mapping = aes(x = K_Al, y = resid, color = Strat))
This was a very simple, exploratory analysis of the data.
Fitting models:
Further reading:
Wickham and Grolemund (2016)
Load your data into R with readr package
read_csv(): comma separated (CSV) filesread_tsv(): tab separated filesread_delim(): general delimited filesDescription on website: “In many cases, these functions will just work!”
Reversely, you can also write back to several file formats with write_*
PAGES_example()
## [1] "bonenburg_raw.csv" "kuhjoch_raw.csv"
read_csv(PAGES_example("bonenburg_raw.csv"))
## # A tibble: 108 x 13 ## SampleID Height CaCO3 TN del13Ctoc TOCcfb `Al2O3 (%)` `Na2O (%)` `K2O (%)` ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0 3.01 13.3 0.06 -27.5 1.16 15.6 0.62 3.25 ## 2 60 3.56 2.67 NA -25.5 0.27 13.1 1.27 3.33 ## 3 100 3.95 3.84 0.07 -27.3 0.96 17.4 0.55 3.54 ## 4 150 4.43 5.86 0.07 -27 1.25 17.6 0.44 3.79 ## 5 200 4.94 12.8 0.07 -27.8 1.52 16.4 0.48 3.73 ## 6 250 5.25 3.34 0.09 -27.6 2.45 14.6 0.61 3.42 ## 7 275 5.68 9.91 0.06 -27 1.19 17.3 0.44 4.18 ## 8 300 5.92 NA NA NA NA 15.5 0.46 3.74 ## 9 300 6.16 22.2 0.06 -27.1 1.21 NA NA NA ## 10 350 6.41 20.5 0.06 -27.5 1.14 15.1 0.51 3.92 ## # … with 98 more rows, and 4 more variables: Strat <chr>, Strat2 <chr>, ## # Section <chr>, Reference <chr>
There are three interrelated rules which make a dataset tidy:
Create, rename, reorder variable and summarise with tidyverse dplyr.
mutate() adds new variables that are functions of existing variablesselect() picks variables based on their names.filter() picks cases based on their values.summarise() reduces multiple values down to a single summary.arrange() changes the ordering of the rows.grouping: group_by() or rowwise()
XRF oxides and normalization with elemental ratios
mutate(
bonenburg_tidy,
# oxide correction
Al_pc = Al2O3_pc * with(marelac::atomicweight, 2 * Al / (2 * Al + 3 * O)),
Na_pc = Na2O_pc * with(marelac::atomicweight, Na / (Na + 2 * O)),
K_pc = K2O_pc * with(marelac::atomicweight, K / (K + 2 * O)),
.keep = "unused"
) %>%
# normalization with Al and rename
mutate(
across(c(Na_pc, K_pc), ~.x / Al_pc, .names = "{gsub(\"pc\", \"\", .col)}Al"),
.keep = "unused"
)
Data
lazy load data: bonenburg and kuhjoch as well as the long formats: bonenburg_long and kuhjoch_long
Raw dataPAGES_example()
Examples
- project: vignette("project", package = "PAGES)
- explore: vignette("explore", package = "PAGES)
- model: vignette("model", package = "PAGES)
- wrangle: vignette("wrangle", package = "PAGES)
Slidesrender_slides()
Fox, John, and Sanford Weisberg. 2018. An R companion to applied regression. Sage publications.
Schobben, Martin, Julia Gravendyck, Franziska Mangels, Ulrich Struck, Robert Bussert, Wolfram M. Kürschner, Dieter Korn, P. Martin Sander, and Martin Aberhan. 2019. “ A comparative study of total organic carbon-\(\delta\)13C signatures in the Triassic–Jurassic transitional beds of the Central European Basin and western Tethys shelf seas.” Newsletters on Stratigraphy 52 (4): 461–86. https://doi.org/10.1127/nos/2019/0499.
Wickham, Hadley, and Garrett Grolemund. 2016. R for data science: import, tidy, transform, visualize, and model data. O’Reilly Media, Inc. https://r4ds.had.co.nz/index.html.
Wilkinson, Leland, Graham Wills, D Rope, Andrew Norton, and Roger Dubbs. 2005. The Grammar of Graphics (Statistics and Computing).
Zuur, Alain F., Elena N. Ieno, Neil J. Walker, Anatoly A. Saveliev, and Graham M. Smith. 2008. Mixed Effects Models and Extensions in Ecology with R. https://doi.org/10.4324/9780429201271-2.